Last Update: 12.05.2024
Updated majority of the plots to use ggplot2 and its
friends rather than base R, changed formatting (reorganized
headers and subheaders), changed html theme to spacelab, added section
4
# reproducability
set.seed(5110345)
# no scientific notation
options(scipen = 999)
The English Premier League (EPL) is widely regarded as the world’s most prominent professional men’s soccer league. Founded in 1992 when top English clubs broke away from the Football League First Division, the EPL has become a global phenomenon. Renowned for its lucrative TV broadcasting deals and its showcase of world-class talent, the EPL has attracted massive investment from both club owners and commercial interests in recent years. A notable example is the Qatari ownership of Manchester City, which has transformed the club into a dominant force in the league. This increase in investment has led to a dramatic rise in player valuations, with clubs willing to spend record-breaking sums to secure top talent. This is done through the transfer market, where players are bought and sold between the clubs.
The transfer market is open during specific windows: the summer months of June, July, and August, along with a mid-season window in January. During these periods, clubs negotiate transfer fees which represent the agreed amount for one club to acquire a player from another. Transfer fees are influenced by factors such as player performance, contract length, and market demand.
As a result of high levels of investment, the EPL’s transfer market has become one of the most competitive and volatile in the world. Player values can fluctuate significantly based on a combination of on-field form, media narratives, and external market forces, such as the interests of wealthy investors or the potential for future success in international tournaments. Clubs’ heavy investment in the market has not only intensified the level of competition but also increased the financial stakes, making player valuations one of the most dynamic aspects of the league.
The dataset used for this project was created using the following websites:
The dataset contains data on 241 EPL attackers and midfielders, along with their performance metrics for the 2023-2024 EPL season. Additionally, the dataset contains the transfer market valuation of each player (the target variable). Here are the variables included in the dataset:
Player: Player name
Nation: Nationality of the player
Pos: Position most commonly played by the player
(Categorical, 1 if forward and 0 if midfielder)
Squad: Squad of the player
Age: Age at the start of the season
Born: Year of birth
MP: Amount of matches played (out of 38
possible)
Starts: Amount of matches started (out of 38
possible)
Min: Total amount of minutes played
90s: Total amount of minutes played divided by
90
Gls: Total amount of goals scored
Ast: Total amount of assists provided
G+A: Sum of Gls and
Ast
G-PK: Non-penalty goals
PK: Total amount of penalty goals scored
PKatt: Total amount of penalty kicks
attempted
CrdY: Total amount of yellow cards
CrdR: Total amount of red cards
These metrics are calculated using historical data and probabilistic models to assess the quality of a player’s chance or contributions. These metrics provide a more accurate representation of a player’s true performance potential.
For example, an xG of 0.43 means the player had a 43% chance of scoring given their position and the context of the shot.
All of these are provided by Opta, one of the official statistical partners of the EPL.
xG: Expected goals scored
npxG: Non-penalty expected goals scored
xAG: Expected assists provided
npxG+xAG: Sum of npxG and
xAG
PrgC: A progressive carry occurs when a player moves
the ball forward at least 10 yards or into the final third of the pitch
while dribbling, measuring how effectively a player can carry the ball
and advance play
PrgP: A progressive pass is a pass that moves the
ball at least 10 yards forward or into the attacking third, reflecting a
player’s ability to make attacking passes that drive the play
forward
PrgR: Tracks how often a player receives a pass that
advances the ball by at least 10 yards or into the attacking third,
showing how involved a player is in progressing the play when receiving
the ball
Trnsfr: Transfer market value of a player, expressed in
millions of pounds (£)This dataset is likely independently and identically distributed (i.i.d.) since it includes only players who participated in the 2023-2024 Premier League season, with each player’s performance treated as independent. Additionally, individual player performances are assumed to be unaffected by the performance of others, reinforcing the independence of observations.
The goal of this analysis is to understand the relationship between player performance metrics and transfer market value in the English Premier League. As mentioned, transfer fees are often influenced by external factors such as contract duration and social media presence. This analysis narrows the focus to mainly on-field performance metrics such as goals scored, assists provided, and progressive carries. By using statistical modeling techniques, the project will explore how various metrics contribute to a player’s valuation. Furthermore, challenges such as high collinearity, variable selection, and assumption validation will be addressed to ensure a robust and interpretable model.
library(readxl) # loading dataset
library(dplyr) # subsetting
library(ggplot2) # plotting
library(reshape2) # plotting
library(GGally) # plotting
library(cowplot) # plotting
library(knitr) # tables
library(psych) # summary statistics
library(faraway) # VIF
library(lmtest) # BP Test
library(sandwich) # robust statistics
# load dataset
soccer = read_xlsx("epl.dataset.xlsx")
# big six vector
big_six = c("Arsenal",
"Chelsea",
"Liverpool",
"Manchester City",
"Manchester Utd",
"Tottenham")
# create `six` col
soccer$Six = ifelse(soccer$Squad %in% big_six, 1, 0)
We create a variable named Six, which is an indicator
variable indicating whether or not a player plays for a “Big Six” club.
The “Big Six” clubs in England are the six most successful and affluent
clubs in EPL history. They are:
Arsenal F.C. (Football Club)
Chelsea F.C.
Liverpool F.C.
Manchester City
Manchester United
Tottenham Hotspur F.C.
The reason we do this is because Big Six players tend to have above average transfer market values, so this may influence the transfer market value.
# coerce `Nation` col into factor
soccer$Nation = as.factor(soccer$Nation)
# all of the factor levels
levels(soccer$Nation)
## [1] "al ALB" "ar ARG" "at AUT" "be BEL" "bf BFA" "br BRA" "cd COD"
## [8] "ch SUI" "ci CIV" "cl CHI" "cm CMR" "co COL" "cz CZE" "de GER"
## [15] "dk DEN" "dz ALG" "ec ECU" "eg EGY" "eng ENG" "es ESP" "fr FRA"
## [22] "ga GAB" "gd GRN" "gh GHA" "gw GNB" "hr CRO" "hu HUN" "ie IRL"
## [29] "is ISL" "it ITA" "jm JAM" "jp JPN" "kr KOR" "ma MAR" "ml MLI"
## [36] "mx MEX" "ng NGA" "nir NIR" "nl NED" "no NOR" "nz NZL" "pl POL"
## [43] "pt POR" "py PAR" "rs SRB" "sct SCO" "se SWE" "sn SEN" "tn TUN"
## [50] "tr TUR" "ua UKR" "uy URU" "wls WAL" "za RSA" "zw ZIM"
We coerce the Nation variable into a factor, with each
level of the factor representing a country.
Note that while factor variables typically have fewer levels to maintain model interpretability, the country of origin could be a significant factor influencing transfer market value.
The goal of the exploratory data analysis (EDA) is to gain insights into the relationships between the response variable and some of the most linearly correlated variables. This will be accomplished using visualizations and summary statistics.
# select numeric cols
numeric_data = soccer %>% select_if(is.numeric)
# calculate correlation matrix
cor_matrix = cor(numeric_data)
# reshape cor_matrix into long format
cor_long = melt(cor_matrix)
# create corrplot using ggplot2
ggplot(cor_long, aes(x = Var1, y = Var2, fill = value)) +
geom_tile(color = "white", size = 0.5) + # borders to files
scale_fill_gradient2(low = "dodgerblue",
high = "darkred",
mid = "white",
midpoint = 0,
name = "Correlation") +
geom_text(aes(label = sprintf("%.2f", value)),
color = "black", size = 2) + # add correlation vals
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1), # rotates x-axis labels
axis.text.y = element_text(hjust = 1), # aligns y-axis labels
panel.grid = element_blank(), # removes grid lines
axis.title = element_blank())
numeric_data selects the numeric cols in the
soccer dataset
cor_matrix calculates a correlation matrix for the
cols in numeric_data
cor_long reshapes the matrix into a long format
using the function melt()
ggplot2 is used to create the correlation
plot
# define the response variable Trnsfr
response_variable = "Trnsfr"
# generate list of variables
predictors = setdiff(colnames(numeric_data), response_variable)
# loop
for (predictor in predictors) {
# create scatterplot
plot(numeric_data[[predictor]], numeric_data[[response_variable]],
main = paste("Scatterplot: ", predictor, "vs.", response_variable),
xlab = predictor, ylab = response_variable,
pch = 20, col = "dodgerblue")
# grid-lines
grid()
}
response_variable contains the string for the
response variable
predictors is a list of variables generated by using
setdiff(), which excludes the
response_variable from selection
The loop iterates through each of the predictor variables in
predictors and uses plot() to plot
scatterplots of each variable against the response
# boxplot of `Pos` on `Trnsfr`
ggplot(soccer, aes(x = factor(Pos), y = Trnsfr)) +
geom_boxplot(fill = "dodgerblue", color = "orange", width = 0.25) +
labs(title = "Boxplot: Pos on Trnsfr", x = "Pos (1 if forward, 0 else)", y = "Transfer Market Value") +
theme_minimal()
# barplot of `Pos`
ggplot(soccer, aes(x = factor(Pos))) +
geom_bar(fill = "dodgerblue", color = "orange", width = 0.25) +
labs(title = "Barplot: Pos", x = "Pos (1 if forward, 0 else)", y = "Count of Players") +
theme_minimal()
# boxplot of `Six` on `Trsnfr`
ggplot(soccer, aes(x = factor(Six), y = Trnsfr)) +
geom_boxplot(fill = "dodgerblue", color = "orange", width = 0.25) +
labs(title = "Boxplot: Six on Trnsfr", x = "Six (1 if plays for big six, 0 else)", y = "Transfer Market Value") +
theme_minimal()
# barplot of `Six`
ggplot(soccer, aes(x = factor(Six))) +
geom_bar(fill = "dodgerblue", color = "orange", width = 0.25) +
labs(title = "Barplot: Six", x = "Six (1 if plays for big six, 0 else)", y = "Count of Players") +
theme_minimal()
# barplot of `Nation`
ggplot(soccer, aes(x = factor(Nation))) +
geom_bar(fill = "dodgerblue", color = "orange") +
labs(title = "Barplot: Nation", x = "Nation (Factor)", y = "Count of Players") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# barplot of `Nation` on `Trsnfr`
ggplot(soccer, aes(x = factor(Nation), y = Trnsfr)) +
geom_bar(stat = "identity", fill = "dodgerblue", color = "orange") +
labs(title = "Barplot: Transfer Market Value by Nation", x = "Nation (Factor)", y = "Transfer Market Value") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Here we focus on the categorical and factor variables in the dataset
Boxplots are generated using the geom_boxplot()
function from ggplot2
Barplots are generated using the geom_bar() function
from ggplot2
The last barplot fits Nation with
Trnsfr as the y-value of the plot to show the difference in
transfer values by nation
# generate descriptive statistics
statistics = psych::describe(numeric_data)
# drop unnecessary cols
statistics = statistics %>%
select(-vars, -n, -mad, -skew, -kurtosis, -se)
# round for clarity
statistics = round(statistics, 5)
# col names
colnames(statistics) = c("Mean",
"StDev",
"Median",
"Trimmed Mean",
"Min",
"Max",
"Range")
# neat table
kable(statistics, booktabs = TRUE, align = "r")
psych is used to generate the descriptive
statistics
dplyr is used to filter the statistics and drop the
unnecessary ones such as skew and
kurtosis
round() is used for clarity and
colnames() renames the columns of the dataframe
kable() is used to generate the descriptive
statistics table
ggplot(numeric_data, aes(x = Trnsfr)) +
geom_histogram(binwidth = 10, fill = "dodgerblue", color = "orange", alpha = 1) +
labs(title = "Histogram: Trnsfr",
x = "Trnsfr",
y = "Frequency") +
theme_minimal()
ggplot2 is used to generate a simple histogram of the
response variable Trnsfr# Add the 'Nation' column back
numeric_data = bind_cols(numeric_data, soccer %>% select(Nation))
numeric_data$Nation = as.factor(numeric_data$Nation)
Nation back to our numeric_data
dataframe as a factor# size for BIC
n = nrow(numeric_data)
# fit full additive model
test = lm(Trnsfr ~ ., data = numeric_data) # . includes all predictors
# backwards AIC Search
aic_back_mod = step(test, direction = "backward", trace = 0)
# backwards BIC Search
bic_back_mod = step(test, direction = "backward", trace = 0, k = log(n))
# exhaustive AIC search
aic_exh_mod = step(test, direction = "both", trace = 0)
# exhaustive BIC Search
bic_exh_mod = step(test, direction = "both", trace = 0, k = log(n))
aic_back_mod$call
## lm(formula = Trnsfr ~ Age + MP + Min + `90s` + Gls + Ast + `G-PK` +
## PKatt + xG + npxG + `npxG+xAG` + PrgP + PrgR + Six, data = numeric_data)
bic_back_mod$call
## lm(formula = Trnsfr ~ Age + MP + Ast + `G-PK` + PKatt + PrgP +
## Six, data = numeric_data)
aic_exh_mod$call
## lm(formula = Trnsfr ~ Age + MP + Min + `90s` + Gls + Ast + `G-PK` +
## PKatt + xG + npxG + `npxG+xAG` + PrgP + PrgR + Six, data = numeric_data)
bic_exh_mod$call
## lm(formula = Trnsfr ~ Age + MP + Ast + `G-PK` + PKatt + PrgP +
## Six, data = numeric_data)
n is the number of observations in the
numeric_data dataframe
test fits a model containing all of the variables
which will be used for the search procedures
aic_back_mod is the backwards AIC step search
model
bic_back_mod is the backwards BIC step search
model
aic_exh_mod is the exhaustive AIC step search
model
bic_exh_mod is the exhaustive BIC step search
model
Note we have trace = 0 for all searches in order to
prevent redundant outputs
Note the BIC searches have the argument k = log(n),
which tells the step() function to use BIC rather than
AIC
result_mod = lm(formula = Trnsfr ~ Age + MP + Min + `90s` + Gls + Ast + `G-PK` +
PKatt + xG + npxG + `npxG+xAG` + PrgP + PrgR + Six, data = numeric_data)
vif(result_mod)
## Age MP Min `90s` Gls
## 1.094068 4.792636 118351.607948 118324.979371 907.291828
## Ast `G-PK` PKatt xG npxG
## 4.988636 711.244706 3769.590906 62074.496386 45479.378926
## `npxG+xAG` PrgP PrgR Six
## 36.931231 4.217821 3.713216 1.370008
result_mod is the model obtained from the
searches
We use the vif() function from the library
faraway to obtain the VIF for
result_mod
# subset for step vars
sub = subset(numeric_data, select = c(Trnsfr,
Age,
MP,
Min,
`90s`,
Gls,
Ast,
`G-PK`,
PKatt,
xG,
npxG,
`npxG+xAG`,
PrgP,
PrgR,
Six))
Min and 90s# regress response on all preds except preds of interest
less_min_mod = lm(Trnsfr ~ . - Min, data = sub)
less_ninty_mod = lm(Trnsfr ~ . - `90s`, data = sub)
# regress preds of interest against all preds
min_mod = lm(Min ~ . - Trnsfr, data = sub)
ninty_mod = lm(`90s` ~ . - Trnsfr, data = sub)
# corr of response and min
cor(resid(less_min_mod), resid(min_mod))
## [1] 0.1305785
# corr of response and 90s
cor(resid(less_ninty_mod), resid(ninty_mod))
## [1] -0.1296478
90s# drop 90s from the subset
sub = subset(sub, select = -`90s`)
90s# without 90s
model_1 = lm(formula = Trnsfr ~ Age + MP + Min + Gls + Ast + `G-PK` +
PKatt + xG + npxG + `npxG+xAG` + PrgP + PrgR + Six, data = numeric_data)
vif(model_1)
## Age MP Min Gls Ast `G-PK`
## 1.060381 4.772855 8.163680 907.242010 4.977220 711.236503
## PKatt xG npxG `npxG+xAG` PrgP PrgR
## 3767.937623 62044.786587 45458.490578 36.898586 4.202575 3.701442
## Six
## 1.367369
xG and npxG# regress response on all preds except preds of interest
less_xg_mod = lm(Trnsfr ~ . - xG, data = sub)
less_npxg_mod = lm(Trnsfr ~ . - npxG, data = sub)
# regress preds of interest against all preds
xg_mod = lm(xG ~ . - Trnsfr, data = sub)
npxg_mod = lm(npxG ~ . - Trnsfr, data = sub)
# corr of response and xG
cor(resid(less_xg_mod), resid(xg_mod))
## [1] -0.1728598
# corr of response and npxG
cor(resid(less_npxg_mod), resid(npxg_mod))
## [1] 0.1752576
xG# drop xG from the subset
sub = subset(sub, select = -xG)
90s and xG# without 90s, xG
model_2 = lm(formula = Trnsfr ~ Age + MP + Min + Gls + Ast + `G-PK` +
PKatt + npxG + `npxG+xAG` + PrgP + PrgR + Six, data = numeric_data)
vif(model_2)
## Age MP Min Gls Ast `G-PK` PKatt
## 1.060365 4.751707 8.161051 740.612741 4.973988 583.949962 48.550683
## npxG `npxG+xAG` PrgP PrgR Six
## 21.272951 36.880765 4.202216 3.648548 1.367119
Gls and
G-PK# regress response on all preds except preds of interest
less_gls_mod = lm(Trnsfr ~ . - Gls, data = sub)
less_gpk_mod = lm(Trnsfr ~ . - `G-PK`, data = sub)
# regress preds of interest against all preds
gls_mod = lm(Gls ~ . - Trnsfr, data = sub)
gpk_mod = lm(`G-PK` ~ . - Trnsfr, data = sub)
# corr of response and Gls
cor(resid(less_gls_mod), resid(gls_mod))
## [1] -0.03068311
# corr of response and G-PK
cor(resid(less_gpk_mod), resid(gpk_mod))
## [1] 0.05594432
Gls# drop Gls from the subset
sub = subset(sub, select = -Gls)
90s, xG,
Gls# without 90s, xG, Gls
model_3 = lm(formula = Trnsfr ~ Age + MP + Min + Ast + `G-PK` +
PKatt + npxG + `npxG+xAG` + PrgP + PrgR + Six, data = numeric_data)
vif(model_3)
## Age MP Min Ast `G-PK` PKatt npxG
## 1.060226 4.751618 8.038287 4.810253 5.434749 1.647932 21.103024
## `npxG+xAG` PrgP PrgR Six
## 36.415338 4.132682 3.631521 1.366870
npxG and
npxG+xAG# regress response on all preds except preds of interest
less_npxg_mod = lm(Trnsfr ~ . - npxG, data = sub)
less_npxg_ag_mod = lm(Trnsfr ~ . - `npxG+xAG`, data = sub)
# regress preds of interest against all preds
npxg_mod = lm(npxG ~ . - Trnsfr, data = sub)
npxG_ag_mod = lm(`npxG+xAG` ~ . - Trnsfr, data = sub)
# corr of response and npxG
cor(resid(less_npxg_mod), resid(npxg_mod))
## [1] 0.1185239
# corr of response and npxG+xAG
cor(resid(less_npxg_ag_mod), resid(npxG_ag_mod))
## [1] -0.1210452
npxG# drop npxG from the subset
sub = subset(sub, select = -npxG)
# without 90s, xG, Gls, npxG
model4 = lm(formula = Trnsfr ~ Age + MP + Min + Ast + `G-PK` +
PKatt + `npxG+xAG` + PrgP + PrgR + Six, data = numeric_data)
vif(model4)
## Age MP Min Ast `G-PK` PKatt `npxG+xAG`
## 1.059614 4.699315 8.036784 3.637705 4.590033 1.647739 12.230516
## PrgP PrgR Six
## 3.621714 3.119021 1.346061
G-PK and
npxG+xAG# regress response on all preds except preds of interest
less_gpk_mod = lm(Trnsfr ~ . - `G-PK`, data = sub)
less_npxg_ag_mod = lm(Trnsfr ~ . - `npxG+xAG`, data = sub)
# regress preds of interest against all preds
gpk_mod = lm(`G-PK` ~ . - Trnsfr, data = sub)
npxG_ag_mod = lm(`npxG+xAG` ~ . - Trnsfr, data = sub)
# corr of response and G-PK
cor(resid(less_gpk_mod), resid(gpk_mod))
## [1] 0.3185485
# corr of response and npxG+xAG
cor(resid(less_npxg_ag_mod), resid(npxG_ag_mod))
## [1] -0.04222297
npxG+xAG# drop npxG from the subset
sub = subset(sub, select = -`npxG+xAG`)
# without 90s, xG, Gls, npxG, npxG+xAG
model5 = lm(formula = Trnsfr ~ Age + MP + Min + Ast + `G-PK` +
PKatt + PrgP + PrgR + Six, data = numeric_data)
vif(model5)
## Age MP Min Ast `G-PK` PKatt PrgP PrgR
## 1.054738 4.697207 7.738169 2.671767 2.468864 1.311474 3.542120 2.586803
## Six
## 1.335559
MP and Min# regress response on all preds except preds of interest
less_mp_mod = lm(Trnsfr ~ . - MP, data = sub)
less_min_mod = lm(Trnsfr ~ . - Min, data = sub)
# regress preds of interest against all preds
mp_mod = lm(MP ~ . - Trnsfr, data = sub)
min_mod = lm(Min ~ . - Trnsfr, data = sub)
# corr of response and MP
cor(resid(less_mp_mod), resid(mp_mod))
## [1] -0.1946723
# corr of response and Min
cor(resid(less_min_mod), resid(min_mod))
## [1] 0.1007477
Min# drop Min from the subset
sub = subset(sub, select = -Min)
# without 90s, xG, Gls, npxG, npxG+xAG, Min
model6 = lm(formula = Trnsfr ~ Age + MP + Ast + `G-PK` +
PKatt + PrgP + PrgR + Six, data = numeric_data)
vif(model6)
## Age MP Ast `G-PK` PKatt PrgP PrgR Six
## 1.043371 2.421840 2.671265 2.189610 1.310128 2.467442 2.570758 1.212519
# rename
baseline_mod = model6
Although this code may look overcomplicated, it is actually quite simple and intuitive to understand.
The entire process goes like this:
The initial subset contains all of the variables selected from
the BIC and AIC search process along with the response variable
Trnsfr
We begin eliminating variables using the following procedure:
Find two highly correlated variables (e.g., Min and
90s)
Find their partial correlations with the response
Trnsfr
Remove the variable that is less correlated with the response from the subset
Fit the model with all variables in the updated subset
Check the VIF
Subset the dataset again and repeat the process until the VIF is below 5 for every variable in the model
We fit our resulting model and use it as the baseline model for
the remainder of the project, this is baseline_mod
# generate fitted and residual values and store in dataframe
base_assump_df = data.frame(fitted = fitted(baseline_mod),
residuals = resid(baseline_mod))
# fitted vs. residuals plot for baseline model
ggplot(base_assump_df, aes(x = fitted, y = residuals)) +
geom_point(color = "dodgerblue", shape = 16) +
geom_hline(yintercept = 0, color = "orange", size = 1) + # horizontal line at 0
labs(x = "Fitted", y = "Residuals") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
base_assump_df is a dataframe housing the fitted
values and the residuals of baseline_mod, which are
generated using the fitted() and resid()
functions
ggplot2 is used to graph the Fitted vs. Residuals
plot
geom_hline is used to add a horizontal line at 0 for
easier interpretation
# breusch-pagan test
baseline_bp_test = bptest(baseline_mod)
# test statistic
baseline_bp_test_stat = baseline_bp_test$statistic
# p-value
baseline_bp_pval = baseline_bp_test$p.value
# print
print(paste("The Test Statistic is:", baseline_bp_test_stat))
print(paste("The p-value is:", baseline_bp_pval))
The bptest() function from the lmtest
library is used to perform the Bresuch-Pagan Test
baseline_bp_test_stat and
baseline_bp_pval hold the test statistic and
p-value
# normal q-q plot
qqnorm(resid(baseline_mod),
main = "Normal Q-Q Plot: Final Baseline Model",
col = "dodgerblue")
# add q-q line
qqline(resid(baseline_mod), col = "orange", lwd = 2)
# grid-lines
grid()
qqnorm() is used to generate the Normal Q-Q
plot
qqline() adds the Q-Q line to the plot
# shapiro-wilk test
baseline_sw_test = shapiro.test(resid(baseline_mod))
# p-value
baseline_sw_test_pval = baseline_sw_test$p.value
# print
print(paste("The p-value is:", baseline_sw_test_pval))
shapiro.test() is used to perform the Shapiro-Wilk
Test
baseline_sw_test_pval holds the p-value from the
test
# baseline assumption results dataframe
baseline_assumption_df = data.frame(Result = c("Passed", "Failed", "Failed"))
rownames(baseline_assumption_df) = c("Linearity Assumption",
"Homoscedasticity Assumption",
"Normality Assumption")
kable(baseline_assumption_df, align = "r")
baseline_assumption_df holds the results for the
assumption checking in a kable() table# add player names back to data
numeric_data = bind_cols(numeric_data, soccer["Player"])
# calculate cook's distances
distances = cooks.distance(baseline_mod)
# identify influential points using heuristic
infl_points = which(distances > 4 / length(distances))
# subset for influential players
infl_players = numeric_data[infl_points, ]
# neat table
kable(infl_players["Player"],
booktabs = TRUE, align = "r")
bind_cols() is used to add the Player
variable back to numeric_data
Cook’s distances are calculated using the function
cooks.distances()
which() is used to identify the influential
points
infl_players subsets numeric_data to
obtain the names of the influential players
# indexing for non-influential points only
non_infl_data = numeric_data[-infl_points, ]
# fitting no influential points model
non_infl_model = lm(formula = Trnsfr ~ Age + MP + Ast + `G-PK` + PKatt + PrgP +
PrgR + Six, data = non_infl_data)
non_infl_data is a subset of
numeric_data containing only non-influential
players
non_infl_model is the baseline model fit with only
non-influential data points
# bp test for model with no influential points
non_infl_model_bp_test = bptest(non_infl_model)
# print
print(paste("The test statistic is:", non_infl_model_bp_test$statistic))
print(paste("The p-value is: ", non_infl_model_bp_test$p.value))
bptest() function is used to test the constant variance
assumption for the non-influential model# generate fitted and residual values and store in dataframe
imp_assume_df = data.frame(fitted = fitted(non_infl_model),
residuals = resid(non_infl_model))
# fitted vs. residuals plot for baseline model
ggplot(imp_assume_df, aes(x = fitted, y = residuals)) +
geom_point(color = "dodgerblue", shape = 16) +
geom_hline(yintercept = 0, color = "orange", size = 1) + # horizontal line at 0
labs(x = "Fitted", y = "Residuals",
title = "Fitted vs. Residuals: Non-Influential Model") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
imp_assume_df contains the fitted values and
residuals for the non-influential model
ggplot2 is used to generate the plot
# fit log model
log_mod = lm(formula = log(Trnsfr) ~ Age + MP + Ast + `G-PK` + PKatt + PrgP +
PrgR + Six, data = non_infl_data)
# bp test for log model
log_mod_bp = bptest(log_mod)
# results from bp test
print(paste("The BP-Test test statistic is:", log_mod_bp$statistic))
print(paste("The BP-Test p-value is:", log_mod_bp$p.value))
# sw test for log model
log_mod_sw = shapiro.test(resid(log_mod))
# results from sw test
print(paste("The Shapiro-Wilk p-value is :", log_mod_sw$p.value))
log(Trsnfr) takes the logarithm of the
response
bptest() is used to formally test for constant
variances
shapiro.test() is used to formally test for
normality
# fit sqrt model
sqrt_mod = lm(formula = sqrt(Trnsfr) ~ Age + MP + Ast + `G-PK` + PKatt + PrgP +
PrgR + Six, data = non_infl_data)
# bp test for sqrt model
sqrt_mod_bp = bptest(sqrt_mod)
# results from bp test
print(paste("The BP-Test test statistic is:", sqrt_mod_bp$statistic))
print(paste("The BP-Test p-value is:", sqrt_mod_bp$p.value))
# sw test for sqrt model
sqrt_mod_sw = shapiro.test(resid(sqrt_mod))
# results from sw test
print(paste("The Shapiro-Wilk p-value is :", sqrt_mod_sw$p.value))
sqrt(Trnsfr) is the square root transformation of
the response
Again bptest() and shapiro.test() are
used to formally check the assumptions
# normal q-q plot
qqnorm(resid(sqrt_mod), main = "Normal Q-Q Plot: Square Root Model", col = "dodgerblue")
# add q-q line
qqline(resid(sqrt_mod), col = "orange", lwd = 2)
# grid-lines
grid()
qqnorm() plots the Normal Q-Q plot for square root
model
qqline() adds the q-q line
sqrt_assumption_df = data.frame(Result = c("Passed", "Passed", "Passed (Uncertain)"))
rownames(sqrt_assumption_df) = c("Linearity Assumption",
"Homoscedasticity Assumption",
"Normality Assumption")
kable(sqrt_assumption_df, align = "r")
# rename model
final_mod = non_infl_model
# compute robust standard errors
robust_se = sqrt(diag(vcovHC(final_mod, type = "HC3")))
# compute robust test statistics
robust_t_stats = coef(final_mod) / robust_se
# compute robust p-values
robust_p_values = 2 * pt(-abs(robust_t_stats), df = df.residual(final_mod))
# coefficients
final_coefs = summary(final_mod)$coefficients[, "Estimate"]
# regular test statistics
t_stats = summary(final_mod)$coefficients[, "t value"]
# regular p-values
p_values = summary(final_mod)$coefficients[, "Pr(>|t|)"]
# results dataframe
result_df = data.frame(coef = round(final_coefs, 5),
t = round(t_stats, 5),
pv = p_values,
rob_t = round(robust_t_stats, 5),
rob_pv = robust_p_values)
colnames(result_df) = c("Estimated Coefficient",
"Test Statistic",
"P-Value",
"Robust Test Statistic",
"Robust P-Value")
# neat table
kable(result_df, booktabs = TRUE, align = "r")
sqrt(diag(vcovHC(final_mod, type = "HC3")))
calculates the robust standard errors
Robust test statistics are computed using
coef(final_mod) / robust_se
Robust p-values are computed using
2 * pt(-abs(robust_t_stats), df = df.residual(final_mod))
final_coefs, t_stats, and
p_values contain the estimates, regular test statistics,
and the regular p-values from the original regression summary for
final_mod
result_df is a dataframe containing the
results
# fit nested model
final_less_prgr = lm(formula = Trnsfr ~ Age + MP + Ast + `G-PK` + PKatt + PrgP + Six, data = non_infl_data)
# anova F-Test
prgr_test = anova(final_less_prgr, final_mod)
paste("The p-value is:", prgr_test$`Pr(>F)`[2])
final_less_prgr is the nested model
anova() is used to test whether PrgP is
needed in the model
print(paste("The R-Squared is:", summary(final_mod)$r.squared))
print(paste("The Adjusted R-Squared is:", summary(final_mod)$adj.r.squared))
summary(final_mod)$r.squared obtains the R-Squared
of the regression
summary(final_mod)$adj.r.squared obtains the
Adjusted R-Squared of the regression
f = summary(final_mod)$fstatistic
p = pf(q = f[1], df1 = f[2], df2 = f[3], lower.tail = FALSE)
print(paste("The Significance of Regression Test Statistic is:", f[1]))
print(paste("The P-Value is:", p))
f stores the test statistic and the degrees of
freedom
p computes the p-value for the test statistic on
it’s respective degrees of freedom
SixThe feature-engineered variable Six exhibits a moderate
linear correlation of 0.52 with the target variable Trnsfr,
suggesting that Six explains a significant portion of the
variation in Trnsfr. This implies that including
Six in the model could improve its predictive accuracy.
PrgR, PrgP,
PrgC)The progressive metrics show moderate correlations with
Trnsfr, with all of the metrics having around 0.50
correlation with Trnsfr. This indicates that progressive
actions, such as passes, carries, and receptions, play a significant
role in determining a player’s market value. Including these variables
could improve the model’s ability to capture the impact of player
involvement in advancing play.
CrdY, CrdR)Although CrdY and CrdR are not strongly
correlated with Trnsfr, their inclusion in the model
provides a reflection of a player’s discipline and mentality. This
insight could be particularly useful in cases where clubs value player
temperament as part of their transfer decision-making process.
The actual performance metrics, such as Gls,
Ast, and others, are highly correlated with their expected
counterparts (xG, xAG, etc.). This is expected
since the expected metrics are derived from underlying performance data
and probabilities that align closely with actual outcomes. While both
sets of variables provide valuable insights, their high correlation
could lead to multicollinearity. To avoid inflating variance, we’ll need
to carefully consider which set of metrics to include in the model.
Born and Age are almost perfectly
negatively correlated (-0.99), as they are inverse representations of
the same concept. Including both variables would be redundant, so we
should retain Age as it offers a more intuitive
understanding of player maturity and experience.
Game time metrics such as Min, Starts,
90s, and MP are highly correlated, as expected
because they measure different aspects of playing time. Including all of
them could lead to multicollinearity issues. Therefore, it would be
beneficial to select one or two representative metrics to ensure the
variance is not inflated and the model remains interpretable.
Age vs. TrnsfrAge has a quadratic pattern on Trnsfr,
which aligns with the trend that younger and older players generally
have lower market values. This is because young players are in their
developmental stage while older players are potentially regressing.
Including Age as a quadratic term in the model would
capture this relationship, allowing the model to better reflect the
influence of age on transfer value.
Nearly all performance metrics including actual (e.g.,
Gls), expected (e.g., xG), and progressive
(e.g., PrgC) exhibit signs of high variance. This suggests
that logging these variables could be beneficial. By applying a
logarithmic transformation, we can compress extreme values, reducing the
impact of outliers and stabilizing the variance. This transformation
also makes relationships more linear and interpretable, which improves
the model’s performance and robustness by ensuring more consistent
coefficient estimates.
PosPos indicates whether a player is a midfielder (0) or an
attacker (1).
The highest market value belongs to an attacker (Erling Haaland), which is unsurprising given the premium placed on elite goal-scorers in the transfer market.
The median Trnsfr value across positions is relatively
similar, indicating that positional differences do not drastically
affect median valuations.
However, the spread of Trnsfr values for forwards is
notably high, reflecting a wider range of market values for attackers.
This is likely driven by varying goal-scoring capabilities and
demand.
The distribution of players by position is relatively even, with attackers making up 43% of the sample and midfielders comprising the rest. This balance allows for fair comparison across positions in the model.
SixSix indicates whether a player plays for a Big Six club
(1) or does not play for a Big Six club (0).
The boxplot reveals a significant discrepancy in median
Trnsfr values, with Big Six players having substantially
higher median market values compared to players from other clubs. This
discrepancy is logical, given the stronger financial backing and higher
performance expectations at these clubs.
The variation within the Big Six group is high, as reflected by the wider spread in their boxplot. This is partly due to their smaller sample size, comprising only about 30% of the dataset, and the diverse market valuations of players across these top-tier teams.
NationNation is a factor with levels indicating the
nationality of a player.
English players are the most common nationality, followed by players from major soccer nations such as Brazil and France. This reflects the Premier League’s strong domestic influence and its international appeal, particularly from historically strong soccer countries.
When summing Trnsfr by nationality, English players have
the highest total valuation. This likely reflects the “English premium”,
where domestic players often command higher transfer fees due to league
regulations, fan appeal, and homegrown status requirements.
| Mean | StDev | Median | Trimmed Mean | Min | Max | Range | |
|---|---|---|---|---|---|---|---|
| Pos | 0.43154 | 0.49632 | 0.0 | 0.41451 | 0.0 | 1.0 | 1.0 |
| Age | 24.94606 | 3.82878 | 25.0 | 24.80829 | 17.0 | 35.0 | 18.0 |
| Born | 1997.68050 | 3.84513 | 1998.0 | 1997.81347 | 1988.0 | 2006.0 | 18.0 |
| MP | 26.04149 | 8.20711 | 28.0 | 26.47150 | 10.0 | 38.0 | 28.0 |
| Starts | 17.90041 | 10.64652 | 17.0 | 17.80311 | 0.0 | 37.0 | 37.0 |
| Min | 1596.23651 | 874.91197 | 1495.0 | 1581.51813 | 79.0 | 3325.0 | 3246.0 |
| 90s | 17.73361 | 9.71976 | 16.6 | 17.57098 | 0.9 | 36.9 | 36.0 |
| Gls | 4.21577 | 4.73233 | 3.0 | 3.37306 | 0.0 | 27.0 | 27.0 |
| Ast | 2.66805 | 2.92079 | 2.0 | 2.17617 | 0.0 | 13.0 | 13.0 |
| G+A | 6.88382 | 6.79545 | 5.0 | 5.86528 | 0.0 | 33.0 | 33.0 |
| G-PK | 3.81743 | 4.11500 | 2.0 | 3.10881 | 0.0 | 20.0 | 20.0 |
| PK | 0.39834 | 1.17218 | 0.0 | 0.08808 | 0.0 | 9.0 | 9.0 |
| PKatt | 0.44398 | 1.29018 | 0.0 | 0.10363 | 0.0 | 9.0 | 9.0 |
| CrdY | 3.80498 | 3.07668 | 3.0 | 3.49741 | 0.0 | 13.0 | 13.0 |
| CrdR | 0.10373 | 0.33170 | 0.0 | 0.00000 | 0.0 | 2.0 | 2.0 |
| xG | 4.19129 | 4.39114 | 2.7 | 3.40829 | 0.0 | 29.2 | 29.2 |
| npxG | 3.84025 | 3.76456 | 2.7 | 3.20466 | 0.0 | 22.9 | 22.9 |
| xAG | 2.70000 | 2.55852 | 1.9 | 2.27876 | 0.0 | 11.8 | 11.8 |
| npxG+xAG | 6.54315 | 5.58418 | 5.2 | 5.72383 | 0.1 | 27.3 | 27.2 |
| PrgC | 41.29046 | 37.28894 | 29.0 | 35.70984 | 0.0 | 218.0 | 218.0 |
| PrgP | 72.03734 | 63.04782 | 56.0 | 62.52850 | 2.0 | 376.0 | 374.0 |
| PrgR | 90.51867 | 81.27192 | 64.0 | 78.54922 | 1.0 | 508.0 | 507.0 |
| Trnsfr | 28.71846 | 27.25455 | 20.0 | 24.50518 | 0.4 | 180.0 | 179.6 |
| Six | 0.29876 | 0.45866 | 0.0 | 0.24870 | 0.0 | 1.0 | 1.0 |
The average age is 25, aligning with the typical peak of a player’s physical and athletic prime. This age distribution makes sense, as clubs often prioritize players in their prime for both performance and transfer value.
On average, players participate in 26 matches and start 18, indicating a considerable level of rotation within teams. This could reflect squad depth, tactical flexibility, or player fitness management over the season.
The trimmed mean for Gls highlights how significantly
Erling Haaland skews the average, given his exceptional goal-scoring
record. By trimming extreme values, the mean offers a more balanced
reflection of typical goal tallies across players, mitigating the
influence of outliers like Haaland.
Similarly, expected and progressive metrics are heavily influenced by outliers. Players with exceptionally high values in these categories can distort averages, emphasizing the need for strategies such as logging or trimming to handle variance and maintain model reliability.
Most players receive an average of four yellow cards per season, although one player received 13 which demonstrates an extremely aggressive playstyle. This notable outlier indicates a disciplinary concern and could impact a player’s market perception if such tendencies are considered detrimental.
The maximum number of red cards is 2, suggesting that extreme disciplinary actions are relatively rare. However, it is still impactful enough to consider when evaluating a player’s discipline or risk factor.
Approximately 29.8% of players belong to a Big Six club. This highlights the Premier League’s concentration of top talent within a select group of elite clubs, which likely contributes to their disproportionate impact on market valuations.
The average transfer value is £28.7M, with a trimmed mean of £24.5M. The maximum value, £180M, once again reflects Haaland’s market dominance. This significant gap suggests that a small number of high-value players could be inflating the overall average.
Trnsfr FindingsThe histogram of Trnsfr reveals a right-skewed
distribution with noticeable outliers on the far right, representing
players with exceptionally high transfer market values. These outliers
significantly inflate the mean and can introduce skewness into the
model. This can affect the accuracy and reliability of the model.
Based on the findings from our data analysis, we are ready to proceed
with building a series of regression models to understand
Trnsfr (transfer market value). Our approach will be
systematic, first establishing a baseline model then determining if the
regression assumptions hold. If they do not, we consider other
alternatives such as removing influential points or using robust test
statistics.
Here is the plan:
We will first establish a baseline model. To do this, we will:
Review correlations
Address multicollinearity
Once we have our baseline model, we will validate the regression assumptions to ensure robustness:
The regression assumptions are:
Linearity
Homoscedasticity (constant variance)
Normality of errors (this is really only important for inference, and even then, we should always compute robust test statistics)
Once we determine which tests the model passes and fails, we attempt to make the model more robust.
To present the baseline model, we will first outline the search procedure.
The search procedure begins by fitting a full additive model:
\[ \text{Trnsfr} = X^T\beta \]
Where:
\(\text{Trnsfr}\) is the response variable
\(X^T\) is the transpose of the design matrix \(X\), which contains a column of ones for the intercept and all of the variables
\(\beta\) is the scaler vector of weights corresponding to the variables in \(X^T\)
We then use AIC and BIC backwards and exhaustive searches to select
our model. To do this in R, we use the step()
function. In order to understand the step() function, we
must first understand AIC and BIC.
The Akaike Information Criterion (AIC) is used for model variable selection by balancing model fit and complexity. The formula is:
\[ AIC = -2 \log L(\hat{\beta}, \hat{\sigma}^2) + 2p = n + n \log(2\pi) + n \log\left(\frac{RSS}{n}\right) + 2p \]
Where:
\(L(\hat{\beta}, \hat{\sigma}^2)\) is the likelihood function evaluated at the estimated parameters
\(p\) is the number of parameters in the model
\(RSS\) is the residual sum of squares
\(n\) is the sample size
The penalty term in the AIC is \(2p\) because it is large when \(p\) is large but we are searching for a low AIC.
Therefore, a good model (one with low AIC) will balance between fitting well and using a small number of parameters.
The Bayesian Information Criterion (BIC) is very similar to the AIC except that it has a higher penalty term. The formula is:
\[ BIC = -2 \log L(\hat{\beta}, \hat{\sigma}^2) + \log(n) \cdot p = n + n \log(2\pi) + n \log\left(\frac{RSS}{n}\right) + \log(n) \cdot p \]
The penalty term in this case is \(\log(n)
\cdot p\), which is why we use k = log(n) when using
BIC search for the step() function.
Similarly, a good model (one with low BIC) will balance between fitting well and using a small number of parameters. However, BIC tends to be stricter and uses less parameters than AIC due to the higher penalty term.
For the backward step-wise procedure, the step()
function starts with the full additive model specified above. Then,
R iteratively removes one variable at a time, recalculating
the AIC at each step. The process continues until removing any further
variables no longer improves the AIC. The model with the lowest (best)
AIC is returned as the final model.
The same follows for the BIC.
For the exhaustive step-wise procedure, the step()
function starts with the full additive model specified above. Then,
R evaluates all possible combinations of variables,
considering models of every size. At each step, it recalculates the AIC
for every model variation. The process continues until all combinations
have been assessed, and the model with the lowest (best) AIC is returned
as the final model.
The same follows for the BIC.
Interestingly, we obtain the several models from each of our
searches. However, the BIC searches return models with low VIF (Variance
Inflation Factors) and we want to address high collinearity amongst the
variables so we settle for the AIC backward and exhaustive search
models. Note there is no “correct” model, and that the choice truly
depends on the types of questions one wishes to answer. In this case, we
want to address high collinearity amongst the variables such as
Gls and xG because that makes for a more
interesting project! Hence, the chosen model is:
\[ \text{Trnsfr} = \beta_0 + \beta_1\text{Age} + \beta_2\text{MP} + \beta_3\text{Min} + \beta_4\text{90s} + \beta_5\text{Gls} + \beta_6\text{Ast} + \beta_7\text{G-PK} + \beta_8\text{PKatt} + \beta_9\text{xG} + \beta_{10}\text{npxG+xAG} + \beta_{11}\text{PrgP} + \beta_{12}\text{PrgR} + \beta_{13}\text{Six} + \epsilon_i \] Where \(\epsilon_i\) is the error.
From our exploratory data analysis (EDA), it is evident that certain predictor variables exhibit high correlations. Specifically:
MP, Min, and 90s are all
measures of playing time and are naturally correlated
G-PK, PKatt, xG, and
npxG represent performance metrics and are
interrelated
High collinearity does not prevent model estimation (unlike perfect collinearity) but it still affects the model in negative ways:
Unreliable Standard Errors: The standard errors of the variables are inflated, making it harder to assess the statistical significance of variables
Unstable Coefficients: The estimated coefficients are highly unstable, where small changes in data cause large swings in estimated coefficients
Note that high collinearity does not necessarily impact the
predictive accuracy of our model. We can still generate reliable
predictions despite the presence of high collinearity However,
prediction is not the primary goal of this analysis. Our objective is to
model and interpret the relationships between the variables and the
response variable Trnsfr.
When interpretation is the focus, high collinearity becomes
problematic because it obscures the individual contribution of each
variable. For example, while our model may suggest that playing time and
performance are significant drivers of transfer value, we cannot clearly
determine the extent to which Min, MP, or
90s independently influence Trnsfr due to
their high collinearity.
Thus, reducing high collinearity will improve the interpretability of our coefficients while maintaining the model’s robustness. This ensures that any inferences drawn from the model are both statistically reliable and practically meaningful.
To identify high collinearity in the model, we will use the Variance Inflation Factor (VIF) for each variable. The formula for the VIF is as follows:
\[ \text{VIF}_i = \frac{1}{1 - R_i^2} \] Where \(R_i^2\) is the coefficient of determination obtained by regressing the \(i\)-th variable on all the other variables in the model. In other words, \(R_i^2\) represents the proportion of variance in the \(i\)-th variable that can be explained by the remaining variables. A high \(R_i^2\) suggests that the \(i\)-th variable is highly collinear with others, which leads to a high VIF.
The VIF can be interpreted (loosely because there is no set rule) as follows:
\(VIF_i < 5\) is generally acceptable
\(VIF_i < 10\) is also acceptable, but we will opt for the stricter \(VIF_i < 5\) threshold
\(VIF_i > 10\) suggests high collinearity
| VIF | |
|---|---|
| Age | 1.09407 |
| MP | 4.79264 |
| Min | 118351.60795 |
90s |
118324.97937 |
| Gls | 907.29183 |
| Ast | 4.98864 |
G-PK |
711.24471 |
| PKatt | 3769.59091 |
| xG | 62074.49639 |
| npxG | 45479.37893 |
npxG+xAG |
36.93123 |
| PrgP | 4.21782 |
| PrgR | 3.71322 |
| Six | 1.37001 |
Using the model obtained from the step-wise selection procedure, we
calculate the VIF for all variables. We see that Min,
90s, Gls, G-PK,
PKatt, xG, npxG and
npxG+xAG have exceedingly high VIF values, suggesting
extremely strong collinearity with other variables in the model (as
expected from our EDA). These variables will be the primary focus for
addressing high collinearity.
We proceed by iteratively removing variables based on their partial
correlation coefficients with the response variable Trnsfr.
The process is as follows:
Identify two highly correlated variables with high VIF values
Compute the partial correlations of each variable with
Trnsfr while controlling for all other variables
The partial correlation is computed as follows:
Regress the response Trnsfr on all variables except
variable of interest
Regress variable of interest on all variables except the response
Compute the correlation of the residuals of the models to obtain the partial correlation
Drop the variable with a weaker partial correlation with
Trnsfr
Refit the model and recalculate VIF after each drop
Repeat until all VIF values are below 5
90sThe variables of interest are 90s and Min
because they both measure playing time metrics.
We drop 90s because the magnitude of its partial
correlation coefficient is less than the magnitude of the partial
correlation of Min.
xGThe variables of interest are xG and npxG
because they both measure expected goals scored except npxG
does not include penalty goals.
We drop xG because the magnitude of its partial
correlation coefficient is less than the magnitude of the partial
correlation of npxG.
GlsThe variables of interest are Gls and G-PK
because they both measure goals scored except G-PK does not
include penalty goals.
We drop Gls because the magnitude of its partial
correlation coefficient is less than the magnitude of the partial
correlation of G-PK.
npxGThe variables of interest are npxG and
npxG+xAG because, fairly obviously, npxG is
part of npxG+xAG.
We drop npxG because the magnitude of its partial
correlation coefficient is less than the magnitude of the partial
correlation of npxG+xAG.
npxG+xAGThe variables of interest are now G-PK and
npxG+xAG because these still have high VIF values and
G-PK likely directly influences the npxG
metric.
We drop npxG+xAG because the magnitude of its partial
correlation coefficient is less than the magnitude of the partial
correlation of G-PK.
MinThe results after the fifth iteration show that MP does
not appear to have an inflated variance but Min does, and
because they are both playing time metrics we consider them.
We drop Min because the magnitude of its partial
correlation coefficient is less than the magnitude of the partial
correlation of MP.
After dropping 90s, xG, Gls,
npxG, npxG+xAG, and Min, the
final baseline model is:
\[ \text{Trnsfr} = \beta_0 + \beta_1\text{Age} + \beta_2\text{MP} + \beta_3\text{Ast} + \beta_4\text{G-PK} + \beta_5\text{PKatt} + \beta_6\text{PrgP} + \beta_7\text{PrgR} + \beta_8\text{Six} + \epsilon_i \]
| VIF | |
|---|---|
| Age | 1.04337 |
| MP | 2.42184 |
| Ast | 2.67127 |
G-PK |
2.18961 |
| PKatt | 1.31013 |
| PrgP | 2.46744 |
| PrgR | 2.57076 |
| Six | 1.21252 |
As we can see, the VIF values are now all below our desired threshold so we proceed to assumption validation for the final baseline model.
We are interested in the following assumptions:
Linearity
Homoscedasticity (Constant variance)
Normality of Errors
To check for the linearity assumption, we use a Fitted vs. Residuals plot.
If at any fitted value in the plot, the mean of the residuals is approximately 0, then the linearity assumption holds.
To check for constant variance, we again use the Fitted vs. Residuals plot.
If the spread of the residuals is approximately equal at any fitted value, then the constant variance assumption holds.
We can also check for constant variance using the Breusch-Pagan Test. The null and alternative hypotheses are as follows:
\(H_0\): Homoscedasticity, the errors have constant variance about the true model
\(H_1\): Heteroscedasticity, the errors do not have constant variance about the true model
Hence, if we reject \(H_0\) we conclude that there is heteroscedasticity (non-constant variance) in the model.
To check the normality assumption, we use the Normal Quantile-Quantile plot.
If the points on the plot do not follow a straight line (or approximately straight), then the plot suggests the data does not come from a normal distribution.
We can also check for normality using the Shapiro-Wilk Test. The null and alternative hypotheses are as follows:
\(H_0\): Data is sampled from a normal distribution
\(H_1\): Data is not sampled from a normal distribution
Hence, if we fail to reject \(H_0\), we can conclude that the data is likely to come from a normal distribution.
From the plot, we observe that the majority of the points are scattered near 0 without any discernible pattern, which suggests that the linearity assumption holds for the baseline model. However, there are some larger residuals that are further from 0 which could indicate the presence of outliers or influential points that may be affecting the model fit. Note that even when the linearity assumption is not valid, OLS still generates the best linear approximation (so either way, we are good to go!).
Importantly, the residuals form a funnel pattern where the spread of residuals increases as the fitted values increase. This pattern suggests that the variance of the residuals is likely non-constant, which indicates potential heteroscedasticity (i.e., violation of the assumption of constant variance). To formally test for heteroscedasticity, we will perform the Breusch-Pagan Test. We obtain the following results:
## [1] "The Test Statistic is: 79.8982341662703"
## [1] "The p-value is: 0.0000000000000512495663338689"
From the Breusch-Pagan Test, we obtain a very high test statistic and a p-value close to 0. As a result, we reject the null hypothesis of homoscedasticity at the common significance levels (1%, 5%, 10%) and conclude that there is heteroscedasticity in the model (i.e., the errors do not have constant variance). This means the assumption of constant variance is violated, and any statistical inference (such as confidence intervals or hypothesis tests) made based on the baseline model may not be valid. Note that this issue can be addressed by using heteroscedastic robust test statistics.
The resulting Normal Q-Q plot suggests that the model may not meet the normality assumption. While the points are roughly aligned along the reference line in the middle, there are deviations in the tails, indicating that the residuals may not follow a normal distribution. This could suggest the presence of heavy tails or outliers. To formally assess normality, we conduct the Shapiro-Wilk Test, which will help determine whether the residuals deviate significantly from a normal distribution. Here are the results:
## [1] "The p-value is: 0.0000000806895886292323"
We obtain a p-value that is rejected at the 1%, 5%, and 10% significance levels. We conclude that the residuals likely do not follow a normal distribution.
Here are the combined assumption validation results for the baseline model:
| Result | |
|---|---|
| Linearity Assumption | Passed |
| Homoscedasticity Assumption | Failed |
| Normality Assumption | Failed |
We now attempt to improve the validity of the model.
We first search for unusual observations in the data.
As we have seen in the EDA, there are some high-profile players who are skewing the distribution of the response. In order to identify whether these players affect the regression or not, we must determine if they are “influential”.
An influential data point refers to outliers that have both a high leverage (i.e., could have a large influence when fitting the model) and a large residual. To determine whether a point is influential or not, we use Cook’s Distance. The formula is:
\[ D_i = \frac{1}{p} \cdot \frac{r_i^2 \cdot h_i}{1 - h_i} \]
Where \(D_i\) is Cook’s Distance and \(p\) is the number of parameters in the model.
\(r_i\) is the standardized residual for observation \(i\), calculated as:
\[ r_i = \frac{e_i}{s_e \cdot \sqrt{1 - h_i}}\ \]
Where \(e_i\) is the raw residual for observation \(i\), \(SE(e_i) = s_e \sqrt{1 - h_i}\) is the standardized residual, and \(h_i\) is the leverage for observation \(i\).
Finally, \(h_i\) is considered to be a larger leverage for observation \(i\) if \(h_i > 2\bar{h}\).
Note that a Cook’s Distance is considered large if:
\[ D_i > \frac{4}{n} \]
| Player |
|---|
| Bukayo Saka |
| Declan Rice |
| Gabriel Martinelli |
| Pascal Groß |
| Christopher Nkunku |
| Cole Palmer |
| Nicolas Jackson |
| João Palhinha |
| Mohamed Salah |
| Carlton Morris |
| Erling Haaland |
| Phil Foden |
| Rodri |
| Bruno Fernandes |
| Bruno Guimarães |
| Son Heung-min |
| Pedro Neto |
Using Cook’s Distance we now see the influential players in the dataset and, unsurprisingly, the majority of them are players who are considered world-class (or essentially on the cusp of being world-class).
Here are some reasons for why these players may be influential:
Saka had an exceptional season, recording 16 goals and 9 assists
Rice had a combined 15 goals and assists along with 278
progressive passes (PrgP)
Martinelli completed 345 progressive turns
(PrgR)
Groß recorded a notable 10 assists and completed 302 progressive passes
Nkunku’s 19 progressive passes and 35 turns could be making him influential, although he only featured in 11 matches (which may be another reason)
Palmer had an exceptional season, recording 22 goals and 11 assists
Jackson had a prolific goal scoring record with 14 goals scored
Palhinha recorded 97 progressive passes
Salah had an exceptional season, scoring 18 goals and assisting 10 times
Morris had 15 goals and assists, along with 120 progressive turns
Haaland had an exceptional season, scoring 27 goals (won the golden boot)
Foden had an exceptional season, scoring 19 goals and assisting 8 times
Rodri had an exceptional season, obtaining a total of 17 goals and assists while also completing 376 progressive passes
Fernandes had 10 goals and 8 assists
Guimarães had 15 total goals and assists while also recording 283 progressive passes
Son Heung-min had an exceptional season, scoring 17 goals and assisting 10 times
Neto had 149 turns despite only playing 20 games total, and also obtained 9 assists
Given the influential players are anomalies (i.e., a level above the rest of the players) and not representative of the general population of Premier League players, we will exclude these players from the improved model.
This is not necessarily the “correct” decision. However, when we remove the players, we see that the constant variance assumption is satisfied.
## [1] "The test statistic is: 11.2591254003293"
## [1] "The p-value is: 0.187444622505481"
Based on the p-value, we fail to reject the null for the Breusch-Pagan Test and conclude that the errors have constant variance.
We can also check the Fitted vs. Residuals plot for the model with non-influential points.
We see an improvement in the fitted vs. residuals plot as there is no longer a clear funnel shape and the variance looks more equal. This further supports the validity of the Breusch-Pagan Test.
So now our model passes two assumptions:
Linearity
Constant Variance (homoscedasticity)
We will now consider the normality assumption.
Normality in the errors is typically achieved by transforming the response variable. This may include methods such as logarithmic or square roots. Here, we try both methods.
We will check the normality assumption, and also use the Breusch-Pagan Test to determine whether the transformation affected the constant variance assumption.
We try a logarithmic first.
## [1] "The BP-Test test statistic is: 19.5715285577208"
## [1] "The BP-Test p-value is: 0.0120845386492388"
## [1] "The Shapiro-Wilk p-value is : 0.00000285680302256805"
We see immediately that a logarithmic transformation does not improve our model’s validity. In fact, using a log transformation causes the constant variance assumption to fail. Furthermore, we reject the null hypothesis for the Shapiro-Wilk Test and conclude again that the data is not normally distributed even with a log transformation.
We try the square root transformation next.
## [1] "The BP-Test test statistic is: 13.1135981012104"
## [1] "The BP-Test p-value is: 0.108000371066453"
## [1] "The Shapiro-Wilk p-value is : 0.0453737582842567"
The Breusch-Pagan Test still passes at the 1%, 5%, and 10% significance levels, confirming that the square root model satisfies the homoscedasticity assumption.
For the Shapiro-Wilk Test, we obtain a p-value of about 0.0454. The results are as follows:
Fail to reject at the 1% significance levels
Reject at 5% and 10% significance levels
Since we fail to reject the null hypothesis at the more strict 1% level, it suggests that there is no significant deviation from normality at that threshold. However, the rejection at the 5% and 10% levels indicates that at less strict significance levels, there is some evidence that the data may not follow a perfect normal distribution (although this could be type 1 error).
We can investigate further by checking the Normal Q-Q plot.
Note that after applying the square root transformation, the deviations in the tails of the Q-Q plot are less pronounced, providing further evidence that the data is more closely approximating a normal distribution. This improvement suggests that the transformation has effectively mitigated the skewness, reinforcing the assumption of normality for the regression analysis.
However, due to the uncertainty raised by the Shapiro-Wilk Test, we proceed by computing both robust test statistics and standard test statistics to ensure a more reliable and comprehensive assessment of the model’s validity.
Here are the combined assumption validation results for the improved model (the square root model).
| Result | |
|---|---|
| Linearity Assumption | Passed |
| Homoscedasticity Assumption | Passed |
| Normality Assumption | Passed (Uncertain) |
We continue by computing robust test statistics.
Also note that by using a square root transformation, the interpretation of the model changes drastically.
For example, consider the model: \(\sqrt{Y} = a + bX\)
Then the change with respect to X is: \(Y = 2b (a + bX)\)
Although the square root transformation helped resolve the normality problem, we see that the interpretation is challenging for just a simple model. For this reason, we proceed by using the baseline model without influential-points and computing test statistics for it.
Robust test statistics rely on robust standard errors, which remain valid even if the normality assumption is violated. These standard errors adjust for heteroscedasticity and enhance the reliability of hypothesis tests, particularly when residuals are not homoscedastic.
Violations of normality mainly affect smaller sample sizes, as they can lead to unreliable p-values. However, with large samples, the Central Limit Theorem ensures that the sampling distribution of estimates tends to normality, reducing the impact of non-normal residuals. Given our sample size of 241 observations and 8 independent variables, this is sufficiently large for reliable inference. Hence, even with non-normal residuals, robust standard errors provide reliable results.
The model we will be using is non_infl_mod which is the
baseline model with only non-influential players. This model fails the
normality assumption but satisfies the linearity and constant variance
assumptions. Therefore, it is a great candidate to compute robust
statistics for!
| Estimated Coefficient | Test Statistic | P-Value | Robust Test Statistic | Robust P-Value | |
|---|---|---|---|---|---|
| (Intercept) | 45.66398 | 8.12527 | 0.0000000 | 8.27192 | 0.0000000 |
| Age | -1.45567 | -7.27040 | 0.0000000 | -7.25703 | 0.0000000 |
| MP | -0.44280 | -2.95427 | 0.0034831 | -2.88619 | 0.0042970 |
| Ast | 1.44378 | 3.39012 | 0.0008311 | 3.11182 | 0.0021117 |
G-PK |
1.74733 | 5.74892 | 0.0000000 | 5.64104 | 0.0000001 |
| PKatt | 3.57386 | 3.45699 | 0.0006583 | 2.83546 | 0.0050127 |
| PrgP | 0.15234 | 6.88059 | 0.0000000 | 6.19323 | 0.0000000 |
| PrgR | 0.02676 | 1.58630 | 0.1141418 | 1.30105 | 0.1946325 |
| Six | 15.81705 | 8.61450 | 0.0000000 | 7.60311 | 0.0000000 |
We observe that the regular and robust test statistics are very
similar, with both sets of p-values consistently indicating the
significance of most variables. This alignment suggests that despite the
minor concerns regarding the normality assumption, the results from the
robust statistics confirm the validity of the model’s findings. In
particular, the variables Age, MP,
Ast, G-PK, and Six remain
significant, even when accounting for potential deviations from
normality. The fact that both regular and robust statistics produce
similar results indicates that the model is likely robust to normality
violations, reinforcing the strength and reliability of the
conclusions.
From our results, we see that the variable PrgR is not
significant at the 1%, 5%, or 10% significance levels.
We can use an F-Test to determine whether to retain the variable or not.
## [1] "The p-value is: 0.114141797383035"
The p-value indicates that PrgP is likely not needed in
the regression.
## [1] "The R-Squared is: 0.711275295300586"
## [1] "The Adjusted R-Squared is: 0.700532050474562"
We obtain an R-Squared and Adjusted R-Squared of about 0.7-0.71,
meaning about 70-71% of the variation in the response
Trnsfr is explained by the regression.
## [1] "The Significance of Regression Test Statistic is: 66.2067472927337"
## [1] "The P-Value is: 0.00000000000000000000000000000000000000000000000000000799973061319306"
Since the p-value is extremely close to 0, we reject the null that the regression is insignificant. We conclude that the regression is significant (i.e., at LEAST one of the variables in the regression has a significant linear relationship with the response).